
# Replication Instructions
*(As submitted to Foreign Policy Analysis Dataverse)*

## Step-by-step procedure

1. **Corpus**: 1,974 transcripts → kokkai.ndl.go.jp (permanent links provided in `output_with_data_u.xlsx`).
2. **Tokenization**: MeCab + `unidic-lite` + Japanese WordNet (`wnjpn.db`) for synonym normalization.
3. **Filtering**: POS = noun / verb / adjective / adverb; remove stopwords; remove numbers and punctuation (as specified in the script).
4. **Co-occurrence**: 5-word sliding window → weighted edge list (top 100 terms).
5. **Network analysis**: Gephi 0.10.1 → ForceAtlas2 → modularity (Blondel et al., 2008).
6. **LDA**: Gensim (`gensim.models.LdaModel`), 4 topics, `passes=15`.
7. **Permutation test**: 10,000 resamples.

---

## LDA model and reproducibility

In this study, we employ Latent Dirichlet Allocation (LDA) as the topic modeling method. LDA is a probabilistic model whose parameter estimation relies on random initialization and iterative optimization. As a consequence, even when using the same data and the same hyperparameters, repeated runs of the model may produce slightly different numerical results. These differences are an inherent property of LDA as a stochastic model and do not affect the substantive conclusions of the analysis. To reduce randomness as much as possible, we have implemented the following measures in our code:

### 1. Fixed model implementation and hyperparameters

The topic model is always estimated using the same implementation (`gensim.models.LdaModel`) with a fixed number of topics and passes. In the replication script, we set `optimal_num_topics = 4` and specify `passes=15` when calling:

```python
optimal_model = models.LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=optimal_num_topics,
    passes=15
)
```

### 2. Deterministic text preprocessing pipeline

All documents are read from the same input file and processed by a fully specified pipeline:

1. Japanese tokenization using MeCab (with `unidic_lite` installed);
2. Lemmatization based on MeCab’s morphological features;
3. Synonym normalization using the Japanese WordNet database `wnjpn.db` (downloaded from a fixed URL in the script) and the precomputed `word_to_synonyms` mapping;
4. Stopword removal using a fixed stopword list loaded from `stopwords.csv`;
5. Part-of-speech filtering to keep only nouns, verbs, adjectives, and adverbs, and exclusion of numbers and punctuation;
6. This logic is implemented in the `tokenize_and_normalize` function and applied as:
   ```python
   data['tokens'] = data['text'].apply(tokenize_and_normalize)
   ```

### 3. Consistent construction of the corpus and dictionary

The topic model is always trained on the full set of documents; we do not perform any random subsampling or random train–test splits. The dictionary and bag-of-words corpus are constructed in a deterministic way via:

```python
dictionary = corpora.Dictionary(data['tokens'])
corpus = [dictionary.doc2bow(tokens) for tokens in data['tokens']]
```

so that, for a given preprocessed dataset, the input to LDA is uniquely determined.

### 4. Deterministic computation of monthly topic distributions and figures

After estimating the model, we compute document–topic distributions with:

```python
doc_topics = optimal_model.get_document_topics(corpus, minimum_probability=0)
```

and aggregate them by calendar month using the `date` column (converted to `data['month'] = data['date'].dt.to_period('M')`). Monthly averages and the corresponding time-series plots (topic distribution over time) are obtained through explicit, deterministic Pandas and Matplotlib code, without any further stochastic components.

In addition, the script generates an interactive HTML visualization of the fitted topic model using `pyLDAvis`:

```python
lda_vis = gensimvis.prepare(optimal_model, corpus, dictionary)
pyLDAvis.save_html(lda_vis, f'/content/lda_vis_{optimal_num_topics}topics_modified.html')
```

The resulting HTML file (e.g., `lda_vis_4topics_modified.html`) is also included in the Dataverse replication dataset.

---

## Summary

In summary, the only source of variability across runs comes from the stochastic optimization procedure inherent to LDA itself. Under the fixed preprocessing pipeline, hyperparameter settings, and aggregation steps described above, repeated runs may show small numerical differences, but the semantic content of the topics and the overall temporal patterns reported in the paper remain stable and are robust to such minor fluctuations. We also provide a screenshot of the original Google Colab output as an example of a typical run of the model, which corresponds to the original Google Colab output underlying Figure 5 in the paper. Because LDA is a stochastic model, this screenshot is meant to illustrate the representative pattern of results rather than a bit-for-bit target that must be exactly reproduced.
